Project ID:Project_IDXXXXXXXXX
Report ID:Plant and Animal De novo assembly-2019-04-02

Project Name:Plant and Animal De novo assembly




Report Date:2019-04-02

项目负责:北京诺禾致源科技股份有限公司Denovo业务线

With the development of long-read sequencing technologies, single-molecule sequencing platforms(Pacific Biosciences) and nanopore-based platforms(Oxford Nanopore Technology) can sequence long DNA fragments and provide long reads that could be used in de novo whole-genome assembly. Using efficient algorithms, these methods could provide high-quality assemblies in terms of contiguity and completeness of repetitive regions.

With de novo sequencing, the genome map for a species could be generated, providing a valuable reference sequence for phylogenetic studies, analysis of species diversity, mapping of specific traits and genetic markers, and other genomics research.



1.1 PacBio SMRT Sequencing

Recently, with the development of high-throughput sequencing technology, the third generation sequencing platforms have become new tools of genome research. The main disadvantages of the second generation sequencing technology are, reads are too short and may match with many different regions of the genome and are not unique to any specific region of the sequence. However, the third generation sequencing technology, represented by PacBio, has effectively solved this problem. PacBio long-read sequencing enabled by SMRT Sequencing technology, harnesses the natural process of DNA replication and enables real-time observation of DNA synthesis. With this unique technology, we equip innovative scientists and deliver the results needed to drive genetic discovery.

SMRT Sequencing is built upon two key innovations: zero-mode waveguides (ZMWs) and phospholinked nucleotides. ZMWs allow light to illuminate only the bottom of a well in which a DNA polymerase/template complex is immobilized. Phospholinked nucleotides allow observation of the immobilized complex as the DNA polymerase produces a completely natural DNA strand.

The PacBio Sequel system with greater capacity, longer reads, lower cost and smaller footprint, has be widely used in de novo whole-genome assembly.



Working Flow:

Figure 1 Schematic diagram of experimental flow


2.1 DNA Quantification & Qualification

1) DNA degradation and contamination were examined on 1% agarose gels.

2) DNA purity was checked using the NanoPhotometer® spectrophotometer (IMPLEN, CA, USA).

3) DNA concentration was measured using Qubit® DNA Assay Kit in Qubit® 2.0 Flurometer (Life Technologies, CA, USA).

4) DNA size was examined on pulse field electrophoresis.



2.2 Illumina DNA Library preparation

Total amount of 1.0μg DNA per sample was used as input material for library preparations with NEBNext® DNA Library Prep Kit following manufacturer's recommendations and indices were added to each sample. The genomic DNA is randomly fragmented to size of 350bp, and then DNA fragments were end polished, A-tailed, and ligated with the NEBNext adapters of Illumina sequencing, and further PCR enriched with primers of P5 and P7 oligos. The PCR products as final construct of the libraries were purified (AMPure XP system), followed by quality control test that included size distribution by Agilent 2100 Bioanalyzer (Agilent Technologies, CA, USA), and molarity measurement using real-time PCR. The flowchart below depicts the workflow of library preparation.

Figure 2 Library construction workflow (Illumina)


2.3 Pacbio DNA Library preparation

As shown in the figure 4, the first step in the generation of a SMRTbell library is production of appropriately-sized double-stranded DNA fragments. These fragments can be generated by random shearing of DNA, or by amplification of target regions of interest. The SMRTbell library itself is produced by ligating universal hairpin adapters onto double-stranded DNA fragments. The hairpin dimers formed during this process are removed at the end of the protocol using a magnetic bead purification step with size-selective conditions. Adapter dimers are also efficiently removed using PacBio’s MagBead kit. The final step of the protocol is to remove failed ligation products through the use of exonucleases. After the exonuclease and AMPure PB purification steps, sequencing primer is annealed to the SMRTbell templates, followed by binding of the sequence polymerase to the annealed templates.

Figure 3 Library construction workflow (PacBio)


2.4 Sequencing

Pair-end sequencing was performed on Illumina® NovaSeq platform, with the read length of 150 bp at each end. PacBio SMRT sequencing was performed on PacBio Sequel platform.



Figure 4 Bioinformatics analysis workflow




4.1 Illumina Raw Data

The original fluorescence images obtained from high throughput sequencing platforms are transformed to short reads by base calling. These short reads (Raw data) are recorded in FASTQ format, which contains sequence information (reads) and corresponding sequencing quality information.Every read in FASTQ format is stored in four lines as follows:

   @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

   GCTCTTTGCCCTTCTCGTCGAAAATTGTCTCCTCATTCGAAACTTCTCTGT

   +

   @@CFFFDEHHHHFIJJJ@FHGIIIEHIIJBHHHIJJEGIIJJIGHIGHCCF

   Line 1 begins with a '@' character which is followed by a sequence identifier and an optional description.

   Line 2 shows the sequenced bases.

   Line 3 begins with a '+' character and is optionally followed by the same sequence identifier.

   Line 4 encodes the sequencing quality for each base in line 2, and contain the same number of characters as bases in line 2.

Table 1 Information of Illumina sequence identifiers
IdentifierMeaning
EAS139Unique instrument name
136Run ID
FC706VJFlowcell ID
2Flowcell lane number
2104Tile number within the flowcell lane
15343'x'-coordinate of the cluster within the tile
197393'y'-coordinate of the cluster within the tile
1Member of a pair, 1 or 2 (paired-end or mate-pair reads only)
YY if the read fails filter (read is bad), N otherwise
180 when none of the control bits are on, otherwise it is an even number
ATCACGIndex sequence



4.2 Illumina Data quality control



4.3 PacBio Raw Data Processing and Data Statistics

In real-time single molecule sequencing technology (SMRT), the original sequencing data will contain 0, 1 or more than 2 molecular template information. However, 0 or more than 2 molecular templates will cause great interference to the next step bioinformatics analysis. Thus, we use the software SMRTlink to filter and process original sequencing results, with minLength 0, minReadScore 0.8 as parameters.



4.4 PacBio Polymerase Read Statistics

Polymerase read is a sequence of nucleotides incorporated by the DNA polymerase while reading a template, such as a circular SMRTbell™ template in Sequel system, which are trimmed to the high quality region and include bases from adaptors, as well as potentially multiple passes around a circular template. Polymerase reads are most useful for quality control of the instrument run. Finally, we got the polymerase read statistical results as follows.

Table 3 Polymerase Read Statistics
SamplePolymerase Read Bases(G)Polymerase ReadsPolymerase Read Length(mean)Polymerase Read N50
C1210.01564687639112058
C1310.01564687639112058
C1410.01564687639112058
C1510.01564687639112058

The details for the sequencing data statistics are as follows:

(1) Sample:Samlpe name;

(2) Polymerase Read Bases(G):The total base pair of all polymerase read;

(3) Polymerase Reads:The number of high quality polymerase reads;

(4) Polymerase Read Length (mean):The mean trimmed read length of polymerase reads;

(5) Polymerase Read N50:50% of all polymerase reads are longer than this value.

The distribution of polymerase Read length is shown in Figure 5:

CNTRL1.IS
  • sample1
  • sample2
  • sample3
  • sample4

Polymerase read length distribution




4.5 PacBio Subreads Statistics

Each polymerase read is partitioned to form one or more subreads, which contain sequence from a single pass of a polymerase on a single strand of an insert within a SMRTbell™ template and no adapter sequences. The subreads contain the full set of quality values and kinetic measurements. One example of subreads statistics information is shown in the Table 4.

Table 4 Subread Statistics
SampleSubread Bases(G)SubeadsAverage Subread Length(mean)Subread N50
C1210.01564687639112058
C1310.01564687639112058
C1410.01564687639112058
C1510.01564687639112058

The details for the sequencing data statistics are as follows:

(1) Sample: Samlpe name;

(2) Subreads base(G): The number base pair of all subreads;

(3) Subreads number: The number of high quality subreads;

(4) Average subreads length: The Average read length of all subreads;

(5) N50: 50% of all subreads reads are longer than the value.

The distribution of subread length is shown in Figure 6:

CNTRL1.IS
  • sample1
  • sample2
  • sample3
  • sample4

Polymerase read length distribution




4.6 PacBio long-read de novo assembly

FALCON was used for assembly, here is the procedure of the algorithms.

1) The long reads from Pacbio platform was input for the first step of de novo assembly. According to the insertions, deletion and sequencing error rate of base pair, the long reads from Pacbio platform was perform error correction to obtain pre-assemble reads. Generally, after error correction, the pre-assemble reads can achieve accuracies up to 99.999%.

2) Through the overlap among PacBio long reads (Mean read length 10~15kb, longest reads >40kb), the error corrected subreads were performed by the consensus algorithm called Overlap-Layout-Consensus (DBG2OLC) to obtain consensus sequence.

3) The error correction of preceding assembly was performed by consensus–calling algorithm Quiver. And then, paired-end clean reads from Illumina platform were preformed further correction using Pilon. The strict error correction procedure could improve the accuracy of assembly result and offer high quality consensus sequence which will be used in the subsequent scaffolding.

Figure 7 PacBio Assembly Workflow


4.7 Error Correction of Assembly with Illumina Short-reads Data

Paired-end clean reads obtained from Illumina platform were aligned to the assembly using BWA. Post-processing error correction and conflict resolution of the assembly was performed using Pilon.



4.8 Assembly result

The final assembly came with a contig N50 of 41.7kb, scaffold N50 of 5.2 Mb which covered 2.52 Gb of the genome.

Table 5 Statistics of contig/scaffold assembly result
Sampletotal lengthtotal nummax lengthN50 lengthN50 numN90 lengthN90 num
contig47241821024613951448730988524155546381
scaffold4726645641293318405820498577101222865022

(1) Total length: The total length of contigs/scaffold in the assembly.

(2) Max length: The length of the longest contig in the assembly.

(3) Total number: The number of contigs in the assembly.

(4) N50 length/number: The length L of the contig for which 50% of all bases in the final contigs are of length greater than L.

(5) N90 length/number: The length L of the contig for which 90% of all bases in the final contigs are of length greater than L.

Note: The contig statistics only count contig after scaffolding which length is longer than 100 bp.



4.9 Genome content

Table 6 Statistics of genome content
Sample IDNumber(bp)% of genome
A25000000025
T25000000025
C25000000025
G25000000025
N00
GC50000000050
Total1000000000100

GC content of the genome without N.



4.10 CEGMA evaluation

CEGMA (Core Eukaryotic Genes Mapping Approach) using a highly reliable set of 248 conserved protein families that occur in a wide range of eukaryotic to evaluate the genome assembly.

Table 7 CEGMA result of genome
species#Prots(complete)%completeness(complete)#Prots(complete+partial)%completeness(complete+partial)
XX20582.6624096.77
XX20582.6624096.77
XX20582.6624096.77
XX20582.6624096.77

(1) Complete: core gene >70% number of 248 ultra-conserved CEGs present in genome;

(2) Complete+partial: core gene number of 248 ultra-conserved CEGs present in genome;

(3) #Prots: core gene number of 248 ultra-conserved CEGs present in genome;

(4) %completeness: percentage of 248 ultra-conserved CEGs.



4.11 BUSCO evaluation

BUSCO (Benchmarking Universal Single-Copy Orthologs) using single-copy orthologs gene database, with blastn、augustus and hmmer software to evaluate the integrity of genome assembly.

Table 8 BUSCO result of genome
SpeciesBUSCO notation assessment results
species1C: 91% [D: 8.0%], F: 3.7%, M: 4.7%, n: XX
species2C: 91% [D: 8.0%], F: 3.7%, M: 4.7%, n: XX
species3C: 91% [D: 8.0%], F: 3.7%, M: 4.7%, n: XX
species4C: 91% [D: 8.0%], F: 3.7%, M: 4.7%, n: XX

BUSCO notation assessment results:

C: Complete Single-Copy BUSCOs

D: Complete Duplicated BUSCOs

F: Fragmented BUSCOs

M: Missing BUSCOs

n: Total BUSCO groups searched



1.Eid, John, et al. Real-time DNA sequencing from single polymerase molecules. 2009, Science.

2.Lieberman-Aiden E, van Berkum N L, Williams L, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome[J]. science, 2009, 326(5950): 289-293.

3.Cock, P.J.A., Fields, C.J., Goto, N., Heuer, M.L., and Rice, P.M. (2010). The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic acids research 38, 1767-1771.

4.Hansen, K.D., Brenner, S.E., and Dudoit, S. (2010). Biases in Illuminatranscriptome sequencing caused by random hexamer priming. Nucleic acids research 38, e131-e131.

5.Jiang, L., Schlesinger, F., Davis, C.A., Zhang, Y., Li, R., Salit, M., Gingeras, T.R., and Oliver, B.(2011). Synthetic spike-in standards for RNA-seq experiments. Genome research 21,1543-1551.

6.Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25(14):1754-1760.